Computer-Based Authorship Attribution Without Lexical Measures

نویسندگان

  • Efstathios Stamatatos
  • Nikos Fakotakis
  • George K. Kokkinakis
چکیده

The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Instead we adapt a set of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The latter represent the way in which the text has been analyzed. The presented experiments on a Modern Greek newspaper corpus show that the proposed set of style markers is able to distinguish reliably the authors of a randomly-chosen group and performs better than a lexically-based approach. However, the combination of these two approaches provides the most accurate solution (i.e., 87% accuracy). Moreover, we describe experiments on various sizes of the training data as well as tests dealing with the significance of the proposed set of style markers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lost in Translation: Authorship Attribution using Frame Semantics

We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attri...

متن کامل

Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution

Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a “bag-of-words” and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantif...

متن کامل

Style-Markers in Authorship Attribution A Cross-Language Study of the Authorial Fingerprint

Th e present study addresses one of the theoretical problems of computer-assisted authorship attribution, namely the question which traceable features of language can betray authorial uniqueness (a stylistic fi ngerprint) of literary texts. A number of recent approaches show that apart from lexical measures — especially those relying on the frequencies of the most frequent words — also some oth...

متن کامل

Explaining Delta, or: How do distance measures for authorship attribution work?

Authorship Attribution is a research area in quantitative text analysis concerned with attributing texts of unknown or disputed authorship to their actual author based on quantitatively measured linguistic evidence (see Juola 2006; Stamatatos 2009; Koppel et al. 2009). Authorship attribution has applications in literary studies, history, forensics and many other fields, e.g. corpus stylistics (...

متن کامل

Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST)...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computers and the Humanities

دوره 35  شماره 

صفحات  -

تاریخ انتشار 2001